D208 P.A. : TASK 1 MULTIPLE REGRESSION FOR PREDICTIVE MODELING

Link to the Panopto Video

https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=39419150-cc20-4a0e-8702-ad0b01351db1

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part I: Research Question

A. Describe the purpose of this data analysis by doing the following:

1. Summarize one research question that is relevant to a real-world organizational situation captured in the data set you have selected and that you will answer using multiple regression.

the customer's 'MonthlyCharge' supposed to be one of the most important features to any company, probably the company's profit depends on increasing the Monthly Charge of all customers, I think it can be considered the second most important variable included in the Churn data set (coming after the 'Churn' variable), and specifically the first most important continuous variable.

Multiple regression will be used to check the factors and features that affect the MonthlyCharge and the significance of each of them.


2. Define the objectives or goals of the data analysis. Ensure that your objectives or goals are reasonable within the scope of the data dictionary and are represented in the available data.

Obviously, the monthly charge of each customer depends the most on the number and types of services provided, but some of them may be more important than others. and more significant than others. the objective of the data analysis is to identify the features and to test their inter relationships in order to build a multiple regression prediction model that can be used to predict the monthly charge of each customer based on available features or criteria. thus giving the stake holders the insight to avoid the negative factors and to support the positive ones that may affect the monthly charge , and probably the company's profit.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part II: Method Justification

B. Describe multiple regression methods by doing the following:

1. Summarize the assumptions of a multiple regression model.
  1. A linear relationship is assumed between the dependent variable and the independent variables, we can't accurately find out the exact parameters of this relationship, but we are getting close estimations through linear regression models.
  2. Regression residuals must be normally distributed, meaning that the residuals are concenrtated around the model ( model mean or model line) hence the model could catch all the the linear features , otherwise, the model would be less describtive.
  3. The residuals are homoscedastic and approximately rectangular-shaped.meaning that the variance of risiduals is constant and errors are caused by random factors that affects all predictions randomly, without any significant defference .
  4. Absence of multicollinearity is expected in the model, when two or more independent variables are highly correlated with one another in a regression model (https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/)
  5. No Autocorrelation of the residuals.meaning that the errors are independent from each other.

After :


2. Describe the benefits of using the tool(s) you have chosen (i.e., Python, R, or both) in support of various phases of the analysis.

Selected Python, the general-purpose, interpreted, object-oriented language, which supports many useful packages for creating linear models Selected Python libraries such as:


3. Explain why multiple regression is an appropriate technique to analyze the research question summarized in Part I.

The monthly charge of each customer obviously depends the most on the number and types of services provided, but some of them may be more important than others. and more significant than others. Multiple regression is an appropriate technique to check the factors and features that affect the MonthlyCharge and the significance of each of them, by modelling the relationship between multiple explanatory variables to the single dependent variable ('MonthlyCharge') . and more important , to predict the monthly charge of each customer based on the selected features.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Exploring the Data

Importing Libraries:

Reading the CSV Data:

Data info:

Summary statistics:

Data visualization:

*Function plt_summary() to inspect the data visually usind Histogram, Boxplot and scattered plots. The following function is defined to visually identify statistical parameters and to get the sense from the Data, such as identify the outliers , ranges, dominant values,.etc

Converting Binary categories into numeric:

*Function cat2num() : to convert categorical variables into serial numeric values in integer format

Variables Correlation:

*Function plot_corr_ellipses() imported and modified From text Book "Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python",(Bruce et al.,2020) - The associated GitHub code repository

https://github.com/gedeck/practical-statistics-for-data-scientists/blob/master/python/notebooks/Chapter%201%20-%20Exploratory%20Data%20Analysis.ipynb

PCA:

Conclusion of Data Exploration:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part III: Data Preparation

C. Summarize the data preparation process for multiple regression analysis by doing the following:

1. Describe your data preparation goals and the data manipulations that will be used to achieve the goals.

data preparation goals:


2. Discuss the summary statistics, including the target variable and all predictor variables that you will need to gather from the data set to answer the research question.

3. Explain the steps used to prepare the data for the analysis, including the annotated code.

4. Generate univariate and bivariate visualizations of the distributions of variables in the cleaned data set. Include the target variable in your bivariate visualizations.

Preparing for a Test model

Standardization of numeric variables

Initializing Test model

Applying variance inflation factor test (VIF)

5. Provide a copy of the prepared data set.

Right in the next section.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part IV: Model Comparison and Analysis

D. Compare an initial and a reduced multiple regression model by doing the following:

  1. Construct an initial multiple regression model from all predictors that were identified in Part C2.

Initial model

Saving a copy of the prepared data set.
2. Justify a statistically based variable selection procedure and a model evaluation metric to reduce the initial model in a way that aligns with the research question.

Mainly based on statistical significance (P>|t|) value, with significance limit= 0.05 , the variables with P-value less than 0.05 represnt the rejection of the null hypthesis , mening that these variables are significant and relevant to the research question (prediction of MonthlyCharge). variables with with P-value larger than 0.05 represent failure to reject the null hypthesis meaning that the probability of giving them 0 coeeficient (or excluding them from the model) can be a relevant decesion.

3. Provide a reduced multiple regression model that includes both categorical and continuous variables.

Note: The output should include a screenshot of each model.

Reduced model

Final model

The model applied to unstandardized variables, for more interpretable results

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

E. Analyze the data set using your reduced multiple regression model by doing the following:

1. Explain your data analysis process by comparing the initial and reduced multiple regression models, including the following elements:
• the logic of the variable selection technique

Mainly based on statistical significance (P>|t|) value, with significance limit= 0.05 , the variables with P-value less than 0.05 represnt the rejection of the null hypthesis , mening that these variables are significant and relevant to the research question (prediction of MonthlyCharge). variables with with P-value larger than 0.05 represent failure to reject the null hypthesis meaning that the probability of giving them 0 coeeficient (or excluding them from the model) can be a relevant decesion.

The final set of variables has been chosen based on sequential iterations, by running several models, eliminating non-significant features, then rerunning the model again and so on, until reaching the final model, not all the runs were included in this notebook (this notebook includes only a test model, initial model and a final reduced model).

• the model evaluation metric

The models (initial and reduced) were validated and compared through:

• a residual plot

Typical normal distribution of a discretized variable

2. Provide the output and any calculations of the analysis you performed, including the model’s residual error.

Note: The output should include the predictions from the refined model you used to perform the analysis.

3. Provide the code used to support the implementation of the multiple regression models.

Included in this notebook

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part V: Data Summary and Implications

F. Summarize your findings and assumptions by doing the following:

1. Discuss the results of your data analysis, including the following elements:
• a regression equation for the reduced model

• an interpretation of coefficients of the statistically significant variables of the model

All the variables represent paid services, probably the coefficients are the average monthly payments for each of these services:

(note: *Some very minor difference in the coef. values because of regenerating the models)

etc...


• the statistical and practical significance of the model

• the limitations of the data analysis

2. Recommend a course of action based on your results.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part VI: Demonstration

G. Provide a Panopto video recording that includes all of the following elements:

Link to the Panopto Video

https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=39419150-cc20-4a0e-8702-ad0b01351db1

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

H. List the web sources used to acquire data or segments of third-party code to support the application. Ensure the web sources are reliable.

I. Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.

J. Demonstrate professional communication in the content and presentation of your submission.


References: